Name: Sandeep Kumar Yedla G Number:G01299433
This semester we will be working with a dataset of all domestic outbound flights from Dulles International Airport in 2016.
Airports depend on accurate flight departure and arrival estimates to maintain operations, profitability, customer satisfaction, and compliance with state and federal laws. Flight performance, including departure and arrival delays must be monitored, submitted to the Federal Aviation Agency (FAA) on a regular basis, and minimized to maintain airport operations. The FAA considered a flight to be delayed if it has an arrival delay of at least 15 minutes.
The executives at Dulles International Airport have hired you as a Data Science consultant to perform an exploratory data analysis on all domestic flights from 2016 and produce an executive summary of your key insights and recommendations to the executive team.
Before you begin, take a moment to read through the following airline flight terminology to familiarize yourself with the industry: Airline Flight Terms
The flights_df data frame is loaded below and consists
of 33,433 flights from IAD (Dulles International) in 2016. The rows in
this data frame represent a single flight with all of the associated
features that are displayed in the table below.
Note: If you have not installed the
tidyverse package, please do so by going to the
Packages tab in the lower right section of RStudio, select
the Install button and type tidyverse into the
prompt. If you cannot load the data, then try downloading the latest
version of R (at least 4.0). The readRDS() function has
different behavior in older versions of R and may cause
loading issues.
## importing required libraries
library(tidyverse)
library(skimr)
library(dplyr)
library(plotly);
library(ggplot2)
library(paletteer)
library(corrplot)
library(RColorBrewer)
# importing the flight data
flights_df <- readRDS(url('https://gmubusinessanalytics.netlify.app/data/dulles_flights.rds'))
flights_df
skim(flights_df)
| Name | flights_df |
| Number of rows | 33433 |
| Number of columns | 22 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| factor | 8 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| scheduled_flight_date | 0 | 1 | 2016-01-01 | 2016-12-31 | 2016-07-12 | 364 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| month | 0 | 1 | FALSE | 12 | Jul: 3120, Aug: 3092, Jun: 3071, Oct: 3051 |
| weekday | 0 | 1 | FALSE | 7 | Fri: 5015, Thu: 4993, Wed: 4973, Tue: 4917 |
| airline | 0 | 1 | FALSE | 10 | Uni: 20653, Ame: 2597, Del: 2565, Sou: 2161 |
| tail_num | 0 | 1 | FALSE | 2463 | N63: 125, N69: 113, N66: 112, N66: 109 |
| dest_airport_name | 0 | 1 | FALSE | 40 | San: 4034, Los: 3846, Den: 3628, Har: 3154 |
| dest_airport_city | 0 | 1 | FALSE | 36 | San: 4034, Los: 3846, Den: 3628, Atl: 3154 |
| dest_airport_state | 0 | 1 | FALSE | 26 | Cal: 9177, Col: 3657, Flo: 3511, Geo: 3154 |
| dest_airport_region | 0 | 1 | FALSE | 6 | Wes: 15555, Sou: 7752, Nor: 4002, Mid: 3040 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| month_numeric | 0 | 1 | 6.76 | 3.33 | 1.00 | 4.00 | 7.00 | 10.00 | 12.00 | ▆▅▆▆▇ |
| day | 0 | 1 | 15.72 | 8.78 | 1.00 | 8.00 | 16.00 | 23.00 | 31.00 | ▇▇▇▆▆ |
| flight_num | 0 | 1 | 1049.63 | 1022.03 | 10.00 | 365.00 | 685.00 | 1544.00 | 6840.00 | ▇▂▁▁▁ |
| sch_dep_time | 0 | 1 | 14.00 | 4.96 | 5.33 | 8.83 | 14.90 | 17.75 | 22.95 | ▆▃▃▇▃ |
| dep_time | 0 | 1 | 14.08 | 5.07 | 0.02 | 8.87 | 14.92 | 17.98 | 24.00 | ▁▆▃▇▃ |
| dep_delay | 0 | 1 | 9.07 | 41.69 | -25.00 | -5.00 | -2.00 | 3.00 | 1244.00 | ▇▁▁▁▁ |
| taxi_out | 0 | 1 | 16.95 | 10.80 | 1.00 | 11.00 | 14.00 | 18.00 | 159.00 | ▇▁▁▁▁ |
| wheels_on | 0 | 1 | 14.87 | 5.72 | 0.02 | 10.53 | 15.08 | 19.85 | 24.00 | ▁▃▇▆▇ |
| taxi_in | 0 | 1 | 8.87 | 6.68 | 1.00 | 5.00 | 7.00 | 10.00 | 178.00 | ▇▁▁▁▁ |
| arrival_time | 0 | 1 | 14.89 | 5.79 | 0.02 | 10.60 | 15.12 | 19.97 | 24.00 | ▂▃▇▆▇ |
| sch_arrival_time | 0 | 1 | 14.94 | 5.71 | 0.02 | 10.77 | 15.28 | 20.00 | 23.98 | ▁▃▇▆▇ |
| arrival_delay | 0 | 1 | -0.55 | 45.47 | -94.00 | -20.00 | -11.00 | 2.00 | 1228.00 | ▇▁▁▁▁ |
| distance | 0 | 1 | 1354.55 | 839.61 | 157.00 | 534.00 | 1190.00 | 2288.00 | 4817.00 | ▇▃▆▁▁ |
#View(flights_df)
Executives at this company have hired you as a data science consultant to evaluate their flight data and make recommendations on flight operations and strategies for minimizing flight delays.
You must think of at least 8 relevant questions that will provide evidence for your recommendations.
The goal of your analysis should be discovering which variables drive the differences between flights that are early/on-time vs. flights that are delayed.
Some of the many questions you can explore include:
Are flight delays affected by taxi-out time? Do certain airlines or time of year lead to greater taxi out times (i.e. traffic jams on the runways)?
Are certain times of the day or year problematic?
Are certain destination or airlines prone to delays?
You must answer each question and provide supporting data summaries
with either a summary data frame (using
dplyr/tidyr) or a plot (using
ggplot) or both.
In total, you must have a minimum of 5 plots and 4 summary data frames for the exploratory data analysis section. Among the plots you produce, you must have at least 4 different types (ex. box plot, bar chart, histogram, heat map, etc…)
Each question must be answered with supporting evidence from your tables and plots.
## Subsetting the data with <=15 as less than 15 minutes are considered late
delayed_data <-flights_df %>%
filter(arrival_delay >=15)
delayed_data
## subsetting ontime data for furthur use
ontime_data <-flights_df %>%
filter(arrival_delay<15)
ontime_data
#View(delayed_data)
Are the months(season change or weather) affecting the flight arrival at destination?
Answer: Yes, We can see there is difference in flight arrival during few months, Some particlular months have more flight delays such as in july there are around 748 flights, in june there are 682, August 550 flights delayed, probably it must be summer such that there are more flights operating and due to the air traffic the flights are delayed, also in December we can see that there are 670 flights delayed probably december is winter and the snow or weather is causing the delay. if weather data is available we can analayze in more deatils. The pie chart descibes the percentages of delays occure from the dalayed flight data
Suggested Recommendation The flights schedule must be changed with analyzing the time they are departed from the below Do-nut plots, so that the flight delays are minimized.
weather<-select(delayed_data,month_numeric, airline, arrival_delay, distance,month)
weather
month_count<-weather %>% count(month, name = 'Dealys_in_each_month',sort = TRUE)
month_count
To add additional R code chunks for your work, select
Insert then R from the top of this notebook
file.
fig <- plot_ly(month_count, labels = ~month, values = ~Dealys_in_each_month, type = 'pie')
fig
fig <- fig %>% layout(title = 'Percentage of Delays occured in months',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))%>%
layout(xaxis = list(tickfont = list(size = 15)), yaxis = list(tickfont = list(size = 5)));
fig
#####plot 2
Question
Are the hours or departing hour is affecting the flight arrival at destination due to air traffic?
Answer:
Yes, most of the delayed flgihts are departed during the hours around 16:30 (4:30 P.m) - 21:00 hrs (9:00 P.m), as most of the flights are departed during these hours. It is possible that due to sunset or less light the arrival time is getting affected.
Suggested Recommendation From the below analysis we can recommend that the flight departure schedule can be changed and some of the flights can be scheduled during the 0 (12:00 a.m) - 5 a.m so that the spread of the flights departure is even round the clock which can show impact on flights arrival.
ggplot(delayed_data, aes(x =dep_time)) +
geom_histogram(aes(y = ..density..),bins = 50,
color = "gray", fill = "white") +
geom_density(fill = "black", alpha = 0.2)+
xlim(0,24)+
labs(title="Overall Density Data Distribution ",x="Rating(1-5)",y="Density")+
theme(plot.title = element_text(hjust = 0.52))
#plot 3 summary 1 ## Question 3
Question: Are the specific regions affecting the flight arrival delays?
Answer:
Yes, specific areas like the west region and south region has more flight delays when compared to Northeast, Midwest, Southwest, Middle atlantic. There among the 5258 flights delayed 1752 are delyed in the west region and 806 from the south region.
avg_delay<-mean(flights_df$dep_delay)
avg_delay ##
[1] 9.070529
region_Cond<-delayed_data %>% filter(dep_delay>=avg_delay) %>% group_by(dest_airport_region) %>% summarise(number_of_delays_in_region=n(),Avg_arrival_delay_in_region=mean(arrival_delay))
region_Cond
# Uniform color
ggplot(region_Cond, aes(x = dest_airport_region, y = number_of_delays_in_region)) +
geom_col(position = "dodge") +
labs(title = "Flight dalays based on region",
x = "Region",
y = "Number of flights delayed")+
geom_text(
aes(label = number_of_delays_in_region),
colour = "red", size = 4,
vjust = 0.001, position = position_dodge(.95))
#####plot 4
## Question 4
**Question**
At what hour specifically the most of the flights are delayed?
**Solution**
Most of the flgihts are delayed in the evening hours and afternoon hours that is approximately between the 3 p.m to 9 p.m. and rest fo the hours are flgiht delays are little less comparitively from the below do-nut pie chart.
```r
flighttypes_df <- delayed_data %>%
select(dep_time) %>%
dplyr::mutate(flighttype = ifelse(dep_time <= 6, "Early_hours", ifelse(dep_time <= 12, "LateMorning", ifelse(dep_time <= 18, "Afternoon", "Evening")))) %>%
group_by(flighttype) %>%
dplyr::summarise(n = length(flighttype), .groups = 'keep') %>%
group_by(flighttype) %>%
mutate(percent_of_total = round(n*100/sum(n),1)) %>%
ungroup %>%
data.frame()
plot_ly(flighttypes_df, labels = ~flighttype, values = ~n) %>%
add_pie(hole=0.6) %>%
layout(title="Total Delays of Flights by Time of Day") %>%
layout(annotations=list(text=paste0("Total Flight Delay Count: \n",
scales::comma(sum(flighttypes_df$n))),
"showarrow"=F))
Are the distance travelled by the flight is affecting the flight arrival delays Question:
Yes, the distance travelled by the united airlines is more and we can see that the united airlines american and southwest data distributin is more, where we can say that the flight distance of travel is affecting the flight delay a bit. Answer:
distance_data <- select(delayed_data,distance, airline, arrival_delay)
##distance_data
airline_count<-distance_data %>% count(airline, name = 'Delayed_flight_count_of_airline',sort = TRUE)
airline_count
United_airlines <- filter(distance_data, airline == 'United')
United_airlines
ggplot(data = distance_data, mapping = aes(x = reorder(airline, distance,fun=median),
y = distance, fill = airline)) +
geom_violin() +
geom_jitter(width = 0.08, alpha = 0.6) +
ylim(0,2500)+
labs(title = "Violin Plot of hwy by class",
x = "airlines", y = "distance")
###summary 2
Question:
Are the delayed flgihts are more in number of specific airline and factors affecting that airlines more?
Answer: From the below summary we can see that specific airline like united airlines are 3115 and americam: 538, delta 330 are delayed for the delat airways the avg delay is around 74.9 and american has 66.6 and united around 64.0, similarly the arrival delay is also affected where 79.7, 74.1 and 69.6 respectively. They also have more number of wheels on time.
delayed_data_summary <-delayed_data %>% group_by(airline) %>% summarise(Number_of_delayed_flights = n(),
Avg_of_departure_delays=mean(dep_delay),
Avg_of_arrival_delays=mean(arrival_delay),
Avg_wheels_on=mean(wheels_on))
ontime_data_summary <-ontime_data %>% group_by(airline) %>% summarise(Number_of_delayed_flights = n(),
Avg_of_departure_delays=mean(dep_delay),
Avg_of_arrival_delays=mean(arrival_delay),
Avg_wheels_on=mean(wheels_on))
delayed_data_summary
ontime_data_summary
####summary 3
Are the air_port functioning like taxi_out, taxi_in, departure delays affetcing the delay of flights arrival?
Question: Yes we can see that the average departure delay from Fort Lauderable_hollywwod airport is more and the highest departure delays is around 420 and next is daniel k inouye with 112 min delay at the same time the taxi out time is 18 min and 16 min respectively at these airports.
Answer:
delayed_data %>% group_by(dest_airport_name) %>%
summarise(Avg_Departure_delays=mean(dep_delay),
Avg_wheels_on=max(wheels_on),
Min_taxi_out_time=min(taxi_out),
Min_taxi_in=min(taxi_in)) %>%
arrange(desc(Avg_Departure_delays))
Question: Are any specific week days are affecting the delay in arrival of the flight at destination ?
Answer:
We can see that the flight are more delayed in Thursday, friday and on the starting day of the week on monday there are around 893 delayed on thursday and 803 on frinday and monday around 811, from the summary we can see moday 63.6 avg delay and taxi_out is 27.9 similarly in monday it is 68.1 on monday.
delayed_week_summary <-delayed_data %>% group_by(weekday) %>% summarise(No_of_delayed_flights_days = n(),
Avg_of_departure_delays=mean(dep_delay),
Avg_of_taxi_in=mean(taxi_in),
Avg_taxi_out=mean(taxi_out)) %>% arrange(desc(No_of_delayed_flights_days))
delayed_week_summary
#plot 6
week_data <- select(delayed_data,weekday, airline, arrival_delay)
weekday_count<-week_data %>% count(weekday, name = 'Delayed_flight_count_of_airline',sort = TRUE)
weekday_count
ggplot(data = week_data, mapping = aes(x = arrival_delay , fill = weekday)) +
geom_histogram( color = "white", bins = 10) +
facet_wrap( ~ weekday, nrow = 1) +xlim(0,500)+
labs(title = "Distribution of Resting Blood Pressure",
x = "Resting Blood Pressure",
y = "Delayed_flight_count_of_airline")
ggplot(data = ontime_data, mapping = aes(x = arrival_delay , fill = weekday)) +
geom_histogram( color = "white", bins = 15) +
facet_wrap( ~ weekday, nrow = 1) +
labs(title = "Distribution of Resting Blood Pressure",
x = "Arrival delay",
y = "Delayed_flight_count_of_airline")
Write an executive summary of your overall findings and recommendations to the executives at Dulles Airport. Think of this section as your closing remarks of a presentation, where you summarize your key findings and make recommendations on flight operations and strategies for minimizing flight delays.
Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:
An introduction where you explain the business problem and goals of your data analysis
What problem(s) is this company trying to solve? Why are they important to their future success?
What was the goal of your analysis? What questions were you trying to answer and why do they matter?
Highlights and key findings from your Exploratory Data Analysis section
What were the interesting findings from your analysis and why are they important for the business?
This section is meant to establish the need for your recommendations in the following section
Your recommendations to the company
Each recommendation must be supported by your data analysis results
You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation
You must also describe the potential business impact of your recommendation:
Why is this a good recommendation?
What benefits will the business achieve?
Please write your executive summary below. If you prefer, you can type your summary in a text editor, such as Microsoft Word, and paste your final text here.
Introduction:
Flight delays are a severe issue that costs airlines, passengers, and the United States’ economy. A greater knowledge of how weather affects aircraft can aid in the development of forecasts and the reduction of the risk of flight delays.
Key features and findings:
Larger airlines are more likely to experience delays, whereas smaller and less popular airlines are less likely to experience delays. Larger carriers, such as United Airlines and American Airlines, experienced less delays than Delta and Southwest Airlines, suggesting that they may be a more trustworthy alternative when flying. According to the line plot, the days with the least delays were Monday, Tuesday, and Friday. These may be the days you prefer to travel on in the hopes of meeting the fewest delays possible. The most delays occur in the late morning and afternoon, according to the last three visualizations that assessed aircraft delays by airline and hour. Early in the morning, there are minimal delays, and later that evening and night, there are less delays. According to the visuals, flying during these times will shorten your journey and allow you to spend less time on the ground.
The frequency of the same airline flights must be evenly divided throughout the day.
Recommedation to the company:
The ground crew and air traffic employees of certain airlines, such as United, should be increased, since they will be in a better position to manage flights, such as boarding passengers, checking flight status quickly, and issuing a signal to fly, reducing aircraft taxi times. Increasing and training them will decrease the Taxi-in and taxi-out and departure delays. Airlines such as Fronteir Skywest Jet must enhance their flights in order to increase the pace and frequency with which they can be handled. The flights schedule must be changed with analyzing the time they are departed, so that the flight delays are minimized. From analysis we can recommend that the flight departure schedule can be changed and some of the flights can be scheduled during the 0 (12:00 a.m) - 5 a.m so that the spread of the flights departure is even round the clock which can show impact on flights arrival. West and south region must concentrate on departure delays to reduce the flight delays. Many airlines must prepare ahead for the month of December, and aircraft and ground crews must communicate well to reduce air traffic during that period.
If these recommendations the airline will decrese the frequent delays and increase the profits, this analysis would help increasing the airline business.